Skip to content

Conversation

CTTY
Copy link
Contributor

@CTTY CTTY commented Oct 10, 2025

Which issue does this PR close?

What changes are included in this PR?

New:

  • Added new partitioning module with PartitioningWriter trait
  • ClusteredWriter: Optimized for pre-sorted data, requires writing in partition order
  • FanoutWriter: Flexible writer that can handle data from any partition at any time

Modification:

  • (BREAKING) Modified DataFileWriterBuilder to support dynamic partition assignment
  • Updated DataFusion integration to use the new writer API

Are these changes tested?

Added unit tests

/// Build the iceberg writer.
async fn build(self) -> Result<Self::R>;
/// Build the iceberg writer for an optional partition key.
async fn build_with_partition(self, partition_key: Option<PartitionKey>) -> Result<Self::R>;
Copy link
Contributor Author

@CTTY CTTY Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. I believe this is necessary because:

  1. IcebergWriter is supposed to generate DataFile that always hold a partition value according to iceberg spec.

  2. The existing code store partition value in the builder directly, making builder.clone() useless:

let builder = IcebergWriterBuilder::new(partition_A);
let writer_A = builder.build();
... // write to partition A

// done with partition A and now we need to write to partition B
// this is wrong because partition value A is still stored in the builder
let writer_B = builder.clone().build() 

An alternative is to add a new method clone_with_partition() but that would also be a breaking change and it's less clean compared to build_with_partition()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this change, but I want a further change as following:

async fn build(&self, partition_key: Option<PartitionKey>) -> Result<Self::R>

If the builder could be reused for creating actual IcebergWriter, I want to avoid cloning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is not changed.

Copy link
Contributor Author

@CTTY CTTY Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I thought I replied earlier but it seems my comment didn't go thru.

I think your advice makes sense but it would be better to fix this in a separate PR.

Changing IcebergWriterBuilder::builder to &self means we will need to change all the writer builder along the writers chain to use &self (IcebergWriterBuilder -> RollingFileWriterBuilder -> FileWriterBuilder). Otherwise, the builder would look like the following and won't help too much since we still need to clone the inner. This also can cause further confusion to users since we have different semantics for each writer builder

    async fn build(&self, partition_key: Option<PartitionKey>) -> Result<Self::R> {
        Ok(DataFileWriter {
            inner: Some(self.inner.clone().build()), // inner builder still needs to be cloned
            partition_key,
        })
    }

Please let me know your thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaination, that makes sense to me. It would be better to create an issue to track it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a tracking issue here: #1753

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr! Just finished first round of review, and I think we are on the right track!

/// Build the iceberg writer.
async fn build(self) -> Result<Self::R>;
/// Build the iceberg writer for an optional partition key.
async fn build_with_partition(self, partition_key: Option<PartitionKey>) -> Result<Self::R>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this change, but I want a further change as following:

async fn build(&self, partition_key: Option<PartitionKey>) -> Result<Self::R>

If the builder could be reused for creating actual IcebergWriter, I want to avoid cloning.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr!

@liurenjie1024 liurenjie1024 merged commit 856597b into apache:main Oct 17, 2025
16 checks passed
@CTTY CTTY deleted the ctty/parpar-new branch October 17, 2025 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement fanout partitioned data writer.

2 participants